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Abstract 

We  present  our  experiences  in  using  Java  as  an  intermediate  language  for  the  high-level  program¬ 
ming  language  Nesl.  First,  we  describe  the  design  and  implementation  of  a  system  for  translating 
VCODE — the  current  intermediate  language  used  by  Nesl — into  Java.  Second,  we  evaluate  this 
translation  by  comparing  the  performance  of  the  original  VcODE  implementation  with  several  vari¬ 
ants  of  the  Java  implementation.  The  translator  was  easy  to  build,  and  the  generated  Java  code 
achieves  reasonable  performance  when  using  a  just-in-time  compiler.  We  conclude  that  Java  is 
attractive  both  as  a  compilation  target  for  rapid  prototyping  of  new  programming  languages  and 
as  a  means  of  improving  the  portability  of  existing  programming  languages. 
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1  Introduction 


Intermediate  languages  are  used  by  many  modern  compilers.  Typically  they  are  produced  by  a 
compiler’s  front  end,  which  handles  parsing  and  error  checking  for  a  particular  high-level  language, 
and  are  consumed  by  the  back  end,  which  handles  code  generation  for  a  particular  machine  ar¬ 
chitecture.  Intermediate  languages  simplify  the  inevitable  process  of  porting  a  compiler  to  a  new 
architecture  by  enabling  the  developer  to  re-use  the  front  end  of  the  compiler.  Creating  a  new 
high-level  language  is  also  made  easier  if  an  existing  intermediate  language  is  used,  because  back 
ends  can  then  be  reused. 

Choosing  the  right  intermediate  language  is  an  important  decision  for  language  developers  and 
compiler  writers.  Ideally,  an  intermediate  language  should  be  simple  enough  to  serve  as  a  good 
compiler  target,  while  at  the  same  time  allowing  for  efficient  execution  on  a  range  of  platforms. 
Moreover,  it  should  isolate  low-level  issues  such  as  error  checking  and  memory  management.  This 
paper  was  inspired  by  the  observation  that  the  recently  developed  Java’^  programming  language  [14] 
appears  to  possess  all  of  these  characteristics.  We  wanted  to  know  whether  Java  would  make  a 
good  intermediate  language  for  current  and  future  compilers. 

Java  has  several  attractions  as  an  intermediate  language.  The  first  is  the  design  of  the  language 
itself.  By  being  strongly-typed,  Java  makes  the  compiler  easier  to  debug,  because  many  code 
generator  bugs  will  trigger  type  errors  during  compilation  of  the  resulting  Java  code.  In  contrast,  if 
a  weakly-typed  intermediate  language  were  used,  these  bugs  might  not  be  found  until  runtime.  Java 
also  provides  garbage  collection,  thereby  removing  a  large  source  of  potential  memory-management 
bugs  in  the  generated  code.  In  addition,  if  the  language  being  implemented  must  itself  provide 
garbage  collection,  the  developer  can  simply  push  the  responsibility  down  to  Java. 

Another  attraction  of  Java  is  that  it  is  a  portable,  network-aware  language.  The  details  of 
machine  architecture,  operating  system,  and  display  environment  are  all  handled  transparently  by 
the  Java  virtual  machine  [15].  The  same  Java  program  can  run  on  a  Unix  workstation,  a  PC, 
and  a  Macintosh,  while  retaining  the  same  “look  and  feel”  on  each  platform.  Using  Java  as  an 
intermediate  language  also  allows  programs  to  be  distributed  in  an  executable  form  (Java  bytecode) 
over  the  Internet. 

Finally,  Java  is  a  highly  successful  commercial  product.  This  fact  has  huge  advantages  for 
a  language  developer.  In  particular,  there  is  a  whole  industry  devoted  to  porting  Java  to  new 
platforms,  improving  Java  compilers  and  run-time  systems,  writing  libraries  of  Java  objects,  and 
fixing  bugs.  The  developer  would  normally  have  to  do  all  of  these  chores  if  a  special-purpose 
intermediate  language  were  used. 

Given  all  these  advantages,  what  might  stop  us  from  using  Java  as  an  intermediate  language? 
There  are  three  main  questions  to  consider: 

•  Is  Java  easy  to  use  in  a  new  or  existing  system? 

•  Does  Java  provide  sufficient  functionality  to  model  the  features  of  the  source  language? 

•  Can  the  resulting  programs  be  efficiently  executed  by  a  Java  virtual  machine? 

To  try  to  answer  these  questions,  we  built  a  system  that  translates  VCODE  [5]  (a  specialized 
intermediate  language  for  the  high-level  parallel  language  Nesl  [4])  into  Java,  and  performed  a 
series  of  benchmarks  to  compare  this  new  implementation  with  the  original. 

^Java  is  a  trademark  of  Sun  Microsystems,  Inc.  All  other  trademarks  in  this  paper  are  the  property  of  their 
respective  owners. 
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The  rest  of  this  paper  is  organized  as  follows.  Section  2  gives  an  overview  of  Nesl,  Vcode, 
and  the  current  Nesl  system.  Section  3  describes  the  translation  of  VcODE,  its  run-time  system, 
and  its  libraries  into  Java.  Section  4  discusses  our  experiences  in  building  the  system  and  outlines 
additional  optimizations  that  we  incorporated  into  the  final  version.  Section  5  presents  benchmark 
results,  and  Section  6  describes  related  projects.  Finally,  Section  7  summarizes  the  work  and  our 
conclusions. 


2  The  Nesl  System 


Java  [14]  has  been  defined  as  “a  simple,  object-oriented,  distributed,  interpreted,  robust,  secure, 
architecture-neutral,  portable,  high-performance,  multithreaded,  dynamic,  buzzword-compliant, 
general-purpose  programming  language”  ? 

In  the  same  spirit  of  buzzword-compliance,  Nesl  [3]  is  an  interactive,  high-level,  strongly-typed, 
applicative,  sequence-based,  portable,  nested  data-parallel  language.  The  primary  data  structure 
in  Nesl  is  the  sequence,  each  element  of  which  can  itself  be  a  sequence.  Parallelism  is  expressed 
in  Nesl  through  an  apply- to-each  form  over  elements  of  sequences  and  through  parallel  operations 
on  sequences. 

The  current  Nesl  system  consists  of  three  layers,  as  shown  in  Figure  1  (see  [7]  for  full  details 
of  the  system).  The  front  end  of  the  system  is  an  interactive  compiler  that  lets  users  enter  Nesl 
expressions  and  programs.  Every  NESLexpression  is  first  compiled  into  an  intermediate  language 
called  Vcode  [5].  The  compiler  then  invokes  a  Vcode  interpreter  (either  locally  or  on  a  remote 
machine),  passes  it  the  Vcode  via  rep  or  a  distributed  filesystem,  and  reads  back  the  results.  The 
Vcode  interpreter  is  the  baek  end  of  the  system;  using  VcODE  as  a  portable  intermediate  language 
allows  the  user  to  execute  the  same  code  transparently  on  difierent  machines,  ranging  from  a  Unix 
workstation  to  a  parallel  supercomputer. 
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Figure  1:  Components  (boxes)  and  languages  (lines)  of  the  current  Nesl  system.  Solid  lines 
represent  translation,  and  dotted  lines  represent  linkage  to  C  libraries  (rounded  boxes). 


The  primary  duties  of  the  Nesl  compiler  are  implementing  high-level  aspects  of  the  Nesl 
language  (such  as  type  checking  and  the  removal  of  higher-order  code)  and  converting  operations  on 

^Sadly,  “buzzword-compliant”  is  now  missing  from  http://java.sun.com/,  although  the  full  definition  remains 
intact  at  several  mirror  sites. 
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arbitrarily  nested  sequences  into  operations  on  segmented  vectors  [2].  Although  Nesl  was  designed 
primarily  to  support  efficient  data-parallel  programming,  the  high-level  algorithmic  nature  of  the 
language  also  makes  it  ideal  for  teaching  and  prototyping  algorithms  [4]. 

The  middle  layer  of  the  system  consists  of  the  intermediate  language  VCODE  and  its  interpreter. 
A  VcODE  program  manipulates  a  stack  of  strongly- typed  vectors.  Each  vector  contains  an  arbitrary 
number  of  atomic  values  of  a  single  type;  Vcode  vectors  cannot  be  nested,  unlike  the  Nesl 
sequences  they  represent.  The  language  provides  a  set  of  vector  operations,  stack  manipulation 
instructions,  and  associated  control  and  memory  management  instructions.  The  main  tasks  of  the 
Vcode  interpreter  are  managing  the  stack  and  vector  memory  efiiciently,  and  implementing  the 
vector  operations  via  calls  to  CVL.  The  extra  overhead  of  interpreting  VcODE  instructions,  rather 
than  executing  a  compiled  version  of  them,  is  amortized  over  the  length  of  the  vectors  on  which 
they  operate.  Note  that  Vcode  shares  several  properties  with  Java  bytecode  [10]:  portability, 
strong  typing,  a  stack-based  execution  model,  and  a  design  allowing  for  easy  interpretation. 

At  the  bottom  of  the  system  is  CvL  (C  Vector  Library),  a  machine-specific  library  that  imple¬ 
ments  an  abstract  vector  machine  [6].  An  example  of  a  CVL  function  is  add_wuz,  which  adds  the 
corresponding  elements  of  two  integer  vectors  together  and  returns  the  results  in  a  third  vector. 
CVL  is  the  only  part  of  the  system  that  must  be  rewritten  for  a  new  architecture  [11]. 

3  Implementing  VcODE  in  Java 

To  use  Java  as  an  intermediate  language  in  an  existing  compiler,  the  current  intermediate  language 
(assuming  that  one  exists)  either  can  be  totally  replaced  by  Java  or  it  can  be  translated  into  Java 
by  an  additional  stage  of  the  compilation  process.  The  first  approach  entails  rewriting  the  front  end 
of  the  compiler  to  generate  Java.  The  second  approach  requires  no  changes  to  existing  parts  of  the 
compiler.  Moreover,  the  additional  translation  stage  should  be  simple  to  implement  because  the 
current  intermediate  language  was  probably  designed  for  easy  compilation.  Finally,  the  front  end 
has  already  performed  most  of  the  error  checking,  and  has  reduced  complex  high-level  constructs 
of  the  source  language  into  more  primitive  operations  that  are  easier  to  map  into  Java. 

However,  the  second  approach  will  probably  generate  less  efficient  Java  code  than  the  first, 
because  semantic  information  is  lost.  In  particular,  the  code  produced  will  be  Java  “in  the  style 
of’  the  existing  intermediate  language,  and  probably  will  not  take  full  advantage  of  Java  language 
features  (in  the  same  way  that  Fortran  written  in  the  style  of  Lisp  is  unlikely  to  be  efficient) .  These 
observations  are  not  specific  to  Java;  they  apply  to  any  intermediate  language. 

We  chose  the  second  approach,  favoring  ease  of  implementation  over  the  efficiency  of  the  gener¬ 
ated  code  (we  investigate  the  final  impact  on  performance  in  Section  5).  The  design  of  our  system 
is  shown  in  Figure  2.  The  Nesl  compiler  is  unchanged.  Below  it,  the  vcodetojava  phase  converts 
Vcode  into  Java.  A  standard  Java  compiler  is  then  used  to  compile  the  Java,  which  consists  mainly 
of  calls  to  vector  methods,  into  portable  Java  bytecode.  This,  together  with  a  VcodeEmulation 
library  that  implements  the  vector  methods  and  associated  vector  stack,  can  be  executed  by  any 
Java  virtual  machine.  Note  that  we  have  effectively  replaced  the  Vcode  interpreter  with  a  Java 
virtual  machine. 

In  the  rest  of  this  section  we  discuss  the  design  and  implementation  of  each  of  the  major 
components  of  this  system:  the  vector  stack,  the  vector  operations,  and  the  vcodetojava  program. 
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Figure  2:  Components  (boxes)  and  languages  (lines)  of  the  new  NESL-to-Java  system.  Solid  arrows 
represent  translation,  and  dotted  lines  represent  linkage  to  Java  libraries.  New  components  are 
shown  with  heavy  borders. 


3.1  Emulating  the  Vcode  vector  stack 

The  basic  VcODE  data  structure  is  a  vector  of  a  primitive  type.  We  represent  this  by  a  Java  array: 
a  first-class  object  with  a  length  attribute.  Java  also  supplies  a  vector  class,  java. util. Vector, 
which  is  essentially  a  dynamically-sized  array  capable  of  storing  heterogeneous  types.  However, 
because  Vcode  vectors  are  both  homogeneous  and  of  a  fixed  length  once  created,  there  is  no  need 
for  the  extra  generality  and  overhead  of  the  Vector  class.  The  VcODE  types  are  mapped  to  Java 
types  as  shown  in  Table  1. 


Vcode 

Java 

Bits 

INT 

int 

32 

FLOAT 

double 

64 

BOOL 

boolean 

1 

CHAR 

char 

16 

Table  1:  Mapping  of  VcODE  types  to  their  Java  equivalents. 

The  Vcode  interpreter  allocates  space  for  vectors  when  they  are  created,  and  uses  reference 
counting  to  reclaim  the  space  when  they  are  no  longer  used.  In  Java,  we  create  arrays  corresponding 
to  Vcode  vectors  with  the  new()  method,  and  rely  on  them  being  reclaimed  by  the  Java  garbage 
collector  when  they  are  no  longer  used. 

Vcode  operations  receive  all  their  arguments,  and  return  all  their  results,  via  the  vector  stack. 
In  a  typical  implementation,  the  stack  contains  pointers  to  vectors  rather  than  the  vectors  them¬ 
selves.  This  approach  permits  fast  stack  operations  (especially  when  popping,  copying,  or  moving 
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more  than  one  element)  and  enables  multiple  copies  of  the  same  vector  to  be  represented  by  iden¬ 
tical  pointers  to  the  same  piece  of  data.  In  Java,  we  achieve  the  same  effect  by  storing  references 
to  arrays  in  an  instance  of  the  standard  Java  stack  class,  java. util. Stack. 

3.2  Implementing  VCODE  vector  operations  in  Java 

VCODE  provides  over  130  vector  operations.  These  operations  typically  have  a  direct  mapping 
to  functions  provided  by  CVL.  The  VcODE  interpreter  runs  a  function-dispatch  loop  to  execute 
programs:  fetch  the  next  VcODE  operation,  decode  it,  pop  the  appropriate  number  of  arguments 
from  the  vector  stack,  call  the  matching  CvL  function,  and  push  the  result  (s)  back  onto  the  stack. 

For  portability  reasons  we  cannot  rely  on  a  machine-specific  library  such  as  CVL;  all  of  the  vector 
operations  must  be  implemented  as  Java  methods.  However,  the  task  of  writing  the  methods  is 
simplified  because  most  of  them  fall  into  one  of  three  major  groups,  with  the  code  for  operations 
in  each  group  being  very  similar. 

The  vector  methods  are  contained  in  the  VcodeEmulation  class;  this  also  holds  the  stack  where 
array  references  are  stored.  Intuitively,  this  class  implements  an  abstract  vector  stack  object  and  its 
associated  vector  operations.  Just  like  their  VcODE  equivalents,  the  Java  vector  methods  operate 
on  the  stack  itself,  popping  their  arguments  and  pushing  their  results.  Figure  3  shows  a  Java 
method  for  the  VcODE  operation  +  INT,  which  adds  together  two  integer  vectors. 

void  AddI  ()  { 

int[]  a  =  (int  [])  popO; 
intC]  b  =  (int  [])  pop(); 
intC]  dst  =  new  int [a. length] ; 
for  (int  i  =  0;  i  <  a. length;  i++)  { 
dst[i]  =  a[i]  +  b[i]; 

} 

push (dst) ; 

} 

Figure  3:  Java  method  to  implement  the  VcODE  operation  +  INT,  which  adds  two  integer  vectors. 

Note  that  this  code  assumes  that  the  two  argument  arrays  a  and  b  have  the  same  length.  The 
+  INT  operation  makes  the  same  assumption,  but  the  Vcode  interpreter  can  also  check  for  vector 
length  mismatches  at  runtime.  In  the  VcODE-to-Java  system,  Java  throws  an  exception  if  a  runtime 
length  mismatch  causes  the  shorter  array  bound  to  be  over-stepped.  For  full  protection,  we  could 
extend  the  method  to  throw  an  exception  immediately  if  the  two  lengths  are  not  equal. 

Vcode  implements  Nesl’s  nesting  of  data  structures  eflBciently  by  using  segmented  vectors  [2]. 
Segmented  vectors  use  two  kinds  of  vectors  to  represent  arbitrary  sequence  nesting;  a  normal 
non-nested  vector  to  hold  the  data,  and  a  series  of  specialized  vectors  (called  segment  descriptors) 
to  describe  how  the  data  is  subdivided.  Many  VcODE  operations  axe  defined  only  for  segmented 
vectors,  and  require  their  arguments  to  have  segment  descriptors.  We  chose  to  represent  a  segment 
descriptor  in  Java  as  an  array  of  integers  holding  the  individual  segment  lengths.  As  a  consequence, 
the  Java  implementation  of  a  segmented  operation  is  only  slightly  more  complex  than  that  of  its 
unsegmented  counterpart,  with  two  nested  loops  iterating  over  the  segments  and  the  elements 
within  each  segment.  Figure  4  shows  a  Java  method  for  the  Vcode  operation  +_REDUCE  FLOAT, 
a  segmented  add-reduce  (sum)  that  takes  as  arguments  a  segment  descriptor  and  a  floating-point 
data  vector.  The  result  is  a  vector  of  the  sums  of  each  of  the  segments. 


//  pop  the  argument  array 

//  create  a  result  array 
//  loop  over  the  elements . . . 

//  . . .adding  them  together 

//  push  the  result  onto  the  stack 
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final  void  AddReduceF  ()  { 
int[]  segd  =  (int  [])  pop  () ; 
double []  src  =  (double  [] )  pop  ()  ; 
doubled  dst  =  new  double  [segd .  length]  ; 
int  k  =  0; 

for  (int  i  =  0;  i  <  segd. length;  i++)  { 
double  sum  =  0.0; 

for  (int  j  =  0;  j  <  segd[i];  j++)  { 
sum  +=  src  [k++] ; 

} 

dst[i]  =  sum; 

} 

push  (dst) ; 


//  pop  the  segment  descriptor 
//  and  the  source  array 
II  create  a  result  array 

//  loop  over  the  segments . . . 
//  ...initializing  a  sum  of 
//  ...all  values  in  a  segment 

//  , . .and  storing  the  sum 
//  push  the  result 


Figure  4:  Java  method  to  implement  the  segmented  VCODE  operation  +_REDUCE  FLOAT,  which 
sums  the  individual  segments  within  a  floating-point  vector. 


3.3  Translating  VcODE  into  Java 

Rather  than  translating  VcODE  into  Java  and  then  compiling  and  running  the  resulting  Java  code, 
we  could  have  implemented  ^  VCODE  interpreter  in  Java.  There  were  two  main  reasons  for  rejecting 
this  approach:  simplicity  and  efficiency.  Performing  the  translation  is  very  easy,  whereas  writing 
a  new  interpreter  would  have  been  a  much  more  ambitious  undertaking,  and  adding  an  additional 
level  of  interpretation  would  inevitably  slow  down  the  resulting  code. 

The  Java  program  generated  by  the  translation  process  defines  a  single  Java  class  of  the  same 
name  as  the  original  VcODE  program.  This  class  contains  a  Java  method  for  each  of  the  user- 
defined  VcODE  functions.  A  Java  compiler  is  then  used  to  produce  Java  bytecode  from  this  source 
code.  The  bytecode  is  a  portable  executable  representation  of  the  program  and  can  be  run  by  any 
Java  virtual  machine,  just  as  the  original  Vcode  program  can  be  run  by  any  VcODE  interpreter. 

A  simple  Perl  script  called  vcodetojava  performs  the  translation,  using  an  associative  array 
to  map  most  Vcode  operations  into  the  matching  Java  method  calls.  However,  a  few  operations 
require  extra  processing.  In  particular,  the  function-defining  operation  FUNC  is  mapped  to  the 
opening  of  a  new  user-defined  Java  method,  and  the  function-calling  operation  CALL  is  mapped 
to  a  Java  call  of  the  corresponding  user-defined  method.  Both  of  these  operations  also  require 
some  name-demangling  to  ensure  that  VcODE  function  names  remain  legal  in  Java.  The  only  other 
Vcode  control  flow  operation  is  the  IF. .  .ELSE  construct.  This  construct  is  mapped  into  a  Java 
if ..  .else  block,  where  the  if  condition  contains  a  method  call  that  returns  the  boolean  value  on 
top  of  the  vector  stack. 

Figure  5  shows  an  example  of  the  translation  process  for  a  dot-product  function,  which  multiplies 
the  elements  of  two  equal-length  sequences  together  and  returns  the  sum  of  the  products.  The  Nesl 
definition  is  a  single  line  of  code,  and  is  compiled  into  a  seven-line  VcODE  function.  The  Vcode 
function  takes  two  pairs  of  segment  descriptors  and  data  vectors  as  input,  corresponding  to  the  X 
and  Y  sequences.  The  initial  POP  operation  throws  away  a  segment  descriptor  that  is  not  needed  by 
the  following  unsegmented  elementwise  multiplication  ♦  FLOAT  (in  general,  POP  n  m  means  “pop 
n  elements  from  a  depth  of  m”).  The  COPY  operation  copies  a  segment  descriptor  to  the  top  of 
the  stack,  where  it  is  used  by  the  plus-reduce  (sum)  operation.  Finally,  another  POP  discards  the 
segment  descriptor  produced  by  the  plus-reduce  operation,  and  the  function  returns. 

The  Java  method  at  the  bottom  of  Figure  5  is  generated  by  vcodetojava.  As  can  be  seen,  the 
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NeSL: 


function  dotproduct  (X,  Y)  =  sum  ({x  *  y:  x  in  X;  y  in  Y}) 


VCODE:  FUNC  D0TPR0DUCT_47 

POP  1  1 
*  FLOAT 
COPY  1  1 
+_REDUCE  FLOAT 
POP  1  1 


Java:  private  static  void  D0TPR0DUCT_47  ()  •( 

s.Pop  (1,1); 
s.MultF  0; 
s.Copy  (1,1) ; 
s . AddReduceF  ( ) ; 
s.Pop  (1,1); 

} 

Figure  5:  Nesl,  Vcode,  and  Java  representations  of  a  dot-product  function. 

translation  is  very  simple  and  can  be  applied  on  a  line-by-line  basis.  Note  that  the  Java  method 
calls  are  being  applied  to  the  object  s,  which  is  an  instance  of  the  VcodeEmulation  class. 

The  vcodeto j  ava  script  consists  of  about  210  lines  of  Perl,  of  which  80  constitute  the  actual 
algorithm  while  the  rest  initialize  the  associative  array.  On  a  Sun  SPARCstation  5/85  workstation, 
the  script  translates  the  13,500  lines  of  Vcode  corresponding  to  the  Nesl  test  suite  in  about  3.5 
seconds.  This  time  compares  to  about  19  seconds  for  the  Nesl  compiler  to  generate  the  Vcode, 
and  71  seconds  for  Sun’s  portable  compiler  javac  to  compile  the  Java  into  bytecode.  The  times  on 
a  low-end  PC  are  roughly  comparable,  although  a  faster  native  Java  compiler  can  be  used  in  place 
of  Sun’s  portable  compiler.  For  example,  Microsoft’s  Visual  J-I-+  development  environment  takes 
just  over  3  seconds  to  compile  the  same  Java  file  on  a  DX4-120  system. 

4  Pros  and  Cons  of  Java 

It  took  one  of  the  authors  just  over  two  days  to  complete  a  working  prototype  of  the  system,  with 
his  time  divided  roughly  equally  between  implementing  the  Java  stack  model  and  translation  script, 
and  writing  the  vector  methods.  From  this  standpoint,  the  experiment  was  clearly  a  success:  we 
quickly  had  a  working  system  that  enabled  us  to  execute  Nesl  code  on  any  Java  platform. 

In  terms  of  language  features,  Java’s  stack  and  array  classes  were  a  great  help  in  rapid  pro¬ 
totyping,  and  the  language’s  built-in  garbage  collection  meant  that  we  did  not  have  to  adapt  the 
reference-counting  code  used  by  the  VcODE  interpreter.  We  did  not  exploit  Java’s  object-oriented 
features;  there  is  no  inheritance,  and  very  few  little  composition  of  objects.  In  essence,  we  used 
Java  as  a  portable  dialect  of  C  with  garbage  collection  and  a  good  collection  of  preexisting  data 
structures. 

Some  aspects  of  Java  slowed  down  both  the  development  process  and  the  generated  code.  With¬ 
out  templates,  parametric  polymorphism,  or  a  built-in  preprocessor,  it  is  impossible  to  generate 
efficient  type-specialized  versions  of  the  same  basic  method  from  within  Java  itself.  Although  use¬ 
ful  for  prototyping,  the  standard  Java  stack  class  is  limiting  in  that  it  allows  manipulation  only 
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of  the  element  at  the  top  of  the  stack,  whereas  VCODE  requires  the  ability  to  operate  on  multiple 
elements  at  arbitrary  positions  in  the  stack.  In  terms  of  runtime  performance,  creating  Java  arrays 
is  relatively  expensive,  because  they  are  defined  to  be  filled  with  the  null  value  appropriate  for 
their  type.  This  requirement  causes  an  implicit  loop  over  the  array,  even  though  the  initializa¬ 
tion  is  unnecessary  for  arrays  that  will  be  written  before  being  read.  Finally,  because  Java  is  a 
young  language,  there  is  little  performance  data  available  for  use  in  making  informed  design  and 
optimization  decisions. 

However,  it  is  easy  to  solve  or  work  around  most  of  these  problems.  We  could  have  used 
an  external  preprocessor  such  as  the  Unix  m4  tool  to  generate  multiple  type-specialized  versions  of 
each  vector  method.®  Microbenchmarks  were  used  to  establish  the  comparative  cost  of  various  Java 
operations;  the  results  can  be  found  at  http :  //www .  cs ,  emu . edu/”  j ch/ java/optimization . html. 
Profiling  revealed  that  the  awkward  coding  required  to  use  the  standard  Java  stack  class  was  also 
a  significant  performance  bottleneck.  We  therefore  created  a  new  version  of  the  VcodeEmulation 
class  that  represents  the  stack  as  a  directly-accessible  array  of  objects.  Each  element  of  the  stack 
holds  an  array  corresponding  to  a  VcODE  vector,  and  the  stack  is  grown  as  necessary;  this  is 
essentially  a  specialized  version  of  the  standard  Java  vector  class.  This  modification  improved  the 
performance  of  the  stack-intensive  selection  benchmark  (see  Section  5)  by  approximately  30%,  and 
was  adopted  for  the  final  version  of  the  code. 

5  Benchmarking  the  System 

There  are  three  performance  characteristics  of  Java  that  we  wish  to  understand:  the  effectiveness 
of  just-in-time  compilation,  the  cost  of  portability,  and  the  overall  system  speed.  We  therefore 
benchmarked  four  different  implementations  of  VcODE: 

1.  JDK:  The  Java  interpreter  from  Sun’s  Java  Development  Kit  (JDK),  using  the  VcodeEmulation 
class  described  in  Section  3. 

2.  JIT:  A  just-in-time  (JIT)  Java  compiler,  also  using  the  VcodeEmulation  class.  The  just-in- 
time  compiler  compiles  Java  bytecodes  into  machine  code  as  it  interprets  them  for  the  first 
time,  and  then  stores  the  machine  code  for  future  reuse. 

3.  Native:  The  JDK  interpreter,  using  the  VcodeNative  class.  This  class  is  a  replacement  for  the 
VcodeEmulation  class,  and  uses  native  C  functions  similar  to  those  in  CVL  to  implement  the 
vector  methods.  However,  it  still  uses  Java  for  the  vector  stack  code  and  for  array  allocation, 
because  we  want  these  objects  to  be  accessible  to  Java’s  garbage  collector. 

4.  Vinterp:  The  existing  VcODE  interpreter  vinterp,  written  in  C  and  linked  against  a  serial 
version  of  the  CVL  library.  This  combination  has  been  tuned  for  asymptotic  performance 
on  large  vectors,  with  hand-umrolled  loops  and  a  memory  management  mechanism  designed 
specifically  for  VcoDE. 

Comparing  the  performance  of  the  JDK  interpreter  and  JIT  compiler  implementations  helps  us 
understand  the  effectiveness  of  JIT  compiler  technology.  Comparing  the  JDK  and  JIT  implemen¬ 
tations  with  the  native-methods  implementation,  which  uses  compiled  C  code,  lets  us  study  the 

^In  the  same  way,  many  CvL  implementations  use  function-defining  macros  to  generate  typed  versions  of  em 
untyped  function  body. 
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price  of  portability.  Finally,  benchmarking  the  existing  VCODE  interpreter  enables  full  end-to-end 
performance  comparisons. 

We  used  three  different  Nesl  benchmarks  to  compare  the  different  implementations  of  VcODE: 

•  Least-squares  line-fit.  This  function  finds  the  best  fit  to  a  sequence  of  points.  It  is  simple 
straight-line  code  with  no  conditionals  or  loops. 

•  Selection  (generalized  median-finding).  This  function  uses  a  recursive  randomized  algorithm 
to  find  the  element  in  a  vector  that  would  be  at  a  specified  position  if  the  vector  were  sorted. 

•  Sparse  matrix-vector  multiplication.  This  function  multiplies  a  sparse  matrix  stored  in  com¬ 
pressed  row  format  by  a  dense  vector,  using  a  nested  data-parallel  algorithm. 

We  give  the  source  code  and  test  data  for  the  benchmarks  in  Appendix  A.  Timings  for  super¬ 
computer  platforms  have  previously  been  reported  [7,  11].  All  three  benchmarks  have  asymptotic 
running  times  that  are  linear  in  the  size  of  the  problem. 

5.1  Methodology 

To  try  to  expose  any  performance  effects  that  could  be  due  to  machine  architecture  rather  than  to 
the  code  being  tested,  we  used  two  different  machines  for  benchmarking:  a  Sun  SPARCstation  5/85 
with  an  85  MHz  MicroSPARC2  processor  and  32  MB  of  RAM,  running  Solaris  2.4;  and  a  PC  with 
a  120  MHz  AMD  486  processor  and  24  MB  of  RAM,  running  Windows  95.  For  compilation 
we  normally  used  the  GNU  C  Compiler  2.7.0  (gcc  -02)  and  the  Java  compiler  from  Sun’s  Java 
Development  Kit  (JDK)  1.0.2  (javac  -0).  However,  for  the  PC  native  code  benchmarks  we  used 
a  third-party  port  of  JDK  1.0.1,  running  on  Linux  1.2.13.  For  the  just-in-time  compiler  on  the  PC 
we  used  the  JIT  in  Microsoft  Internet  Explorer  3.0  Beta  2.  We  also  tried  Netscape  3.0b5,  but  its 
JIT  proved  to  have  a  much  higher  overhead  and  a  slightly  higher  per-element  cost  on  all  of  the 
benchmarks;  Internet  Explorer  was  18-38%  faster  for  the  largest  benchmark  sizes.  No  JIT  compiler 
was  available  for  the  SPARCstation. 

We  compiled  the  Nesl  source  code  into  Vcode  using  version  3.1  of  the  Nesl  compiler  [3], 
combined  with  an  additional  optimization  phase  that  inlines  VcODE  functions  and  removes  un¬ 
necessary  stack  operations.  All  benchmarks  were  performed  on  idle  machines  to  minimize  outside 
effects.  This  was  particularly  important  for  the  Java  benchmarks,  because  Java  provides  only  a 
tirae-of-day  clock  ( java. lang. System. currentTimeMillisO),  rather  than  a  per-process  timer. 
The  poor  resolution  of  the  PC  clock  also  created  problems.  To  obtain  accurate  timings  of  the 
benchmarks  at  small  problem  sizes,  we  timed  multiple  iterations  of  each  benchmark,  adjusting 
iteration  counts  so  that  each  run  took  at  least  a  second. 

We  ran  the  Java  virtual  machines  with  their  default  heap  sizes,  which  resulted  in  some  garbage 
collection  taking  place  for  all  but  the  smallest  of  runs.  To  reduce  these  nondeterministic  memory 
effects,  we  forced  a  Java  garbage  collection  before  the  beginning  of  each  timing  run.  This  reduced 
the  variance  but  did  not  eliminate  it. 

5.2  Results 

We  timed  each  of  the  benchmarks  at  problem  sizes  ranging  from  2^  to  2^^  (131072).  Table  2  gives 
timings  for  selected  problem  sizes,  averaged  over  five  runs  and  rounded  to  two  significant  figures. 
We’ll  analyze  the  results  for  the  line-fit  benchmark  in  depth,  and  then  briefly  discuss  the  results 
for  the  selection  and  sparse  matrix-vector  multiplication  benchmarks. 
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Problem 

size 

JDK 

120  MHz  486 
JIT  Native 

Vinterp 

SPARCstation  5/85 
JDK  Native  Vinterp 

16 

9.1 

2.6 

5.8 

Line-Fit 

0.73 

9.5 

7.0 

0.62 

128 

17 

4.3 

7.1 

1.3 

16 

8.0 

0.85 

1024 

84 

14 

16 

6.6 

70 

15 

5.4 

8192 

580 

99 

90 

50 

500 

71 

40 

65536 

4800 

800 

570 

460 

4000 

480 

310 

16 

14 

5.4 

8.8 

Selection 

1.3 

15 

11 

1.3 

128 

25 

8.2 

14 

2.1 

25 

16 

2.0 

1024 

74 

18 

25 

4.7 

70 

29 

4.3 

8192 

350 

52 

48 

19 

320 

52 

21 

65536 

2600 

280 

200 

130 

2300 

190 

160 

16 

2.0 

Sparse  matrix-vector  multiplication 

0.62  1.2  0.17  2.0  1.4 

0.14 

128 

3.9 

1.1 

1.4 

0.28 

4.1 

1.8 

0.20 

1024 

18 

3.9 

2.9 

1.4 

17 

2.9 

0.95 

8192 

135 

25 

15 

12 

120 

13 

8.2 

65536 

1100 

200 

120 

110 

940 

97 

68 

Table  2:  Running  times  in  milliseconds  for  three  Nesl  benchmarks  using  different  intermediate 
language  implementations  on  a  120  MHz  486  PC  and  a  SPARCstation  5/85. 


5.2.1  Line-fit  Benchmark 

Figure  6  shows  the  performance  (in  elements  per  second)  achieved  by  the  four  implementations  on 
the  line-fit  benchmark,  with  the  minimum,  average,  and  maximum  times  plotted  for  each  point. 
The  error  bars  are  only  noticeable  for  the  JIT  and  native  code  implementations  at  large  problem 
sizes,  where  there  appeared  to  be  significant  variations  in  the  time  taken  by  garbage  collection 
during  a  run.  We  treat  the  current  VCODE  interpreter  (vinterp)  implementation  as  the  base 
case,  and  discuss  the  results  for  each  of  the  Java  implementations  (native,  JIT,  and  JDK)  in  turn, 
concentrating  on  the  results  for  the  PC  platform. 

The  VcODE  interpreter  is  clearly  the  fastest  of  the  four  implementations  of  VcODE  tested  here, 
as  we  would  expect  for  a  special-purpose  implementation.  Its  relative  performance  compared  to 
the  other  implementations  is  best  at  small  vector  sizes,  and  its  absolute  performance  falls  off  after 
about  16,000  elements,  when  the  PC’s  256  kB  L2  cache  can  no  longer  hold  two  double-precision 
floating-point  vectors. 

The  native  methods  implementation  approaches  the  performance  of  the  VcODE  interpreter  for 
large  problem  sizes,  but  does  not  quite  reach  it.  There  are  two  probable  causes  for  this  performance 
difference.  First,  the  VcODE  interpreter  is  linked  to  a  CVL  library  whose  loops  have  been  unrolled 
by  hand,  reducing  the  loop  overhead.  Second,  the  Java  requirement  that  every  element  of  an 
array  be  initialized  when  the  array  is  allocated  causes  an  extra  loop  over  the  data  that  CvL 
does  not  need  to  perform.  Note  that  there  is  no  clearly  observable  cache  effect  for  the  native 
methods  implementation,  possibly  because  it  is  masked  by  the  additional  memory  activity  of  the 
Java  interpreter. 

The  JIT  compiler  achieves  approximately  half  the  performance  of  the  VcODE  interpreter  for 
problems  bigger  than  about  1,000  elements.  The  additional  slowdown  compared  to  the  native 
methods  implementation  is  probably  due  to  the  requirement  that  every  array  operation  in  Java 
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Figure  6:  Performance  of  Nesl  line-fit  benchmark  using  different  intermediate  language  implemen¬ 
tations  on  a  120  MHz  486  PC  (left)  and  a  SPARCstation  5/85  (right). 

must  check  for  valid  indices.  The  JIT  compiler  must  therefore  generate  extra  conditionals  in  the 
inner  loop  of  vector  code.  There  are  techniques  for  guaranteeing  valid  indices  without  requiring 
these  extra  conditionals,  such  as  performing  loop-bounds  analysis  or  exploiting  virtual  memory 
mechanisms  for  protection  purposes  [1],  but  to  our  knowledge  these  optimizations  are  not  performed 
by  any  current  JIT  compiler.  Note  that  we  were  unable  to  measure  any  extra  compilation  overhead 
incurred  by  the  JIT  compiler;  this  null  result  can  probably  be  attributed  entirely  to  the  poor 
resolution  of  the  PC  clock. 

Finally,  the  JDK  interpreter  achieves  about  one  tenth  the  performance  of  the  Vcode  interpreter, 
and  is  four  to  six  times  slower  than  the  JIT  compiler. 

As  well  as  graphing  the  results  of  the  line-fit  benchmark,  we  can  also  use  them  to  calculate 
the  constant  overhead  and  the  asymptotic  time  per  element  of  each  implementation.  This  is  only 
possible  because  the  benchmark  executes  a  fixed  number  of  VcODE  operations — and  hence  should 
have  a  fixed  interpretive  overhead — for  all  problem  sizes.  The  resulting  figures  for  the  constant 
overhead  and  the  time  per  element  are  shown  in  Table  3. 


120  MHz  486  PC 

SPARCstation  5/85 

Implementation 

Overhead 

Per-elt. 

Overhead 

Per-elt. 

JDK 

8200 

72 

8500 

60 

JIT 

2200 

12 

N/A 

NIA 

Native 

5700 

8.6 

6900 

7.1 

Vinterp 

690 

7.1 

610 

4.7 

Table  3:  Constant  overhead  and  asymptotic  time  per  element  (in  microseconds)  for  Nesl  line-fit 
benchmark  using  different  intermediate  language  implementations  on  a  PC  and  a  SPARCstation. 

Using  the  overhead  and  the  asymptotic  time  per  element  we  can  now  understand  why  the 
performance  curves  for  the  JIT  compiler  and  the  native  methods  implementation  cross.  The  JDK 
interpreter  used  by  the  native  methods  implementation  has  a  higher  constant  overhead  than  the 
JIT  compiler,  and  this  overhead  dominates  performance  for  small  problem  sizes.  However,  for  large 
problem  sizes  the  per-element  speed  advantage  of  the  native  vector  methods  outweighs  the  higher 
overhead  of  the  JDK  interpreter,  and  the  native  methods  implementation  is  faster. 

We  can  also  calculate  the  percentage  of  the  total  running  time  due  to  the  constant  overhead, 
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as  shown  in  Figure  7.  The  disparity  between  the  speed  of  native  methods  and  the  speed  of  the 
Java  interpreter  is  reflected  in  the  top  curve;  because  the  native  methods  are  much  faster  than  the 
interpreter,  the  impact  of  the  fixed  interpretive  overhead  is  bigger.  The  JDK  and  JIT  compiler 
implementations  have  similar  results  to  that  of  the  VCODE  interpreter,  indicating  that  the  ratios 
of  their  interpretation  and  execution  costs  are  comparable,  and  the  problem  size  at  which  they 
achieve  half  of  their  peak  performance  (ni/2)  is  similar.  Put  another  way,  the  performance  differ¬ 
ence  between  the  specialized  VcODE  interpreter  and  Lhe  general-purpose  Java  interpreter  and  JIT 
compiler  is  roughly  the  same  as  the  performance  difference  between  the  machine-specific  C  code  in 
CVL  and  the  Java  vector  methods  in  VcodeEmulation  being  executed  by  the  Java  interpreter  and 
JIT  compiler. 


Problem  size  Problem  size 


Figure  7:  Percentage  of  interpretive  overhead  in  Nesl  line-fit  benchmark  for  diff^erent  intermediate 
language  implementations  on  a  120  MHz  486  PC  (left)  and  a  SPARCstation  5/85  (right). 

The  shapes  of  the  performance  curves  on  the  SPARCstation  are  generally  similar  to  those  on 
the  PC,  although  the  cache  effect  for  the  VcODE  interpreter  is  much  more  pronounced  and  happens 
at  around  500  elements,  due  to  the  SPARCstation’s  much  smaller  cache.  This  general  similarity 
of  the  results  for  the  two  platforms  is  true  for  all  three  benchmarks,  suggesting  that  there  are  no 
architecture-dependent  effects  skewing  the  results.  The  two  platforms  are  also  comparable  in  terms 
of  absolute  speed. 

5.2.2  Selection  Benchmark 

Figure  8  shows  the  performance  achieved  for  the  selection  benchmark.  The  ordering  of  the  results 
is  the  same  as  for  the  line-fit  benchmark:  for  small  problem  sizes,  the  VcODE  interpreter  is  fastest, 
followed  by  the  JIT  compiler,  native  methods,  and  finally  the  JDK;  for  large  problem  sizes,  the 
ordering  of  the  JIT  compiler  and  the  native  methods  implementation  is  reversed.  However,  the 
shapes  of  the  curves  are  different  than  for  the  line-fit  benchmark,  reflecting  the  fact  that  the 
selection  benchmark  spends  less  time  in  straight-line  code  than  line-fit,  and  places  more  emphasis 
on  recursion  and  dynamic  memory  use.  In  particular,  the  performance  curves  of  the  JIT  compiler 
and  the  native  methods  implementation  cross  at  a  larger  problem  size,  because  there  is  more  work 
going  on  in  the  Java  virtual  machine  (where  the  JIT  compiler  is  faster)  and  less  in  the  vector 
methods  (where  the  native  methods  are  faster). 
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Figure  8:  Performance  of  Nesl  selection  benchmark  using  different  intermediate  language  imple¬ 
mentations  on  a  120  MHz  486  PC  (left)  and  a  SPARCstation  5/85  (right). 

5.2.3  Sparse  matrix-vector  multiplication  benchmark 

Figure  9  shows  the  performance  achieved  on  the  final  benchmark,  sparse  matrix-vector  multiplica¬ 
tion.  The  ordering  of  the  results  is  the  same  as  for  the  previous  two  benchmarks.  Even  though 
this  is  a  nested  data-parallel  algorithm  that  uses  segmented  VcoDE  operations,  the  shapes  of  the 
graphs  and  the  performance  ratios  are  similar  to  those  for  the  non-nested  line-fit  benchmark,  which 
uses  mostly  unsegmented  operations.  Note  that  there  is  less  variance  in  the  results  than  for  line-fit 
because  sparse-matrix  vector  multiplication  uses  fewer  temporary  vectors,  and  hence  less  garbage 
collection  occurs. 


Figure  9:  Performance  of  Nesl  sparse  matrix-vector  multiplication  benchmark  using  different  in¬ 
termediate  language  implementations  on  a  120  MHz  486  (left)  and  a  SPARCstation  5/85  (right). 


5.3  Memory  Usage 

The  space  efficiency  of  an  intermediate  language  is  often  equally  ac  important  as  its  time  efficiency. 
The  Java  VcodeEmulation  class  and  the  existing  CvL  implementation  use  essentially  the  same  data 
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types,  and  so  their  memory  usage  per  vector  is  similar.  For  example,  a  Java  integer  array  of  length 
n  occupies  4n  +  16  bytes  in  the  Sun  JDK,  compared  to  4n  bytes  in  a  typical  C  implementation. 
However,  the  dynamic  memory  usage  of  the  VCODE  and  Java  interpreters  differ.  The  VCODE 
interpreter  is  optimized  for  the  case  of  a  few  big  objects  (vectors),  whereas  Java’s  general-purpose 
memory  allocation  mechanism  is  optimized  for  many  small  objects.  In  particular,  the  VcODE 
interpreter  uses  reference  counting  to  determine  when  a  vector  is  no  longer  used,  and  hence  can 
reclaim  its  space  immediately.  The  interpreter  has  to  halt  and  compress  vector  memory  only  when 
it  can  no  longer  find  a  free  fragment  large  enough  to  satisfy  a  request.  By  comparison,  current  Java 
virtual  machines  typically  perform  garbage  collection  only  when  the  system  is  idle,  when  there  is 
no  longer  enough  free  memory,  or  on  demand. 

As  an  example,  the  VcODE  interpreter  requires  just  over  1.75  MB  of  heap  to  run  the  line-fit 
benchmark  on  input  vectors  of  length  2^®  (65536)  without  performing  memory  compaction.  A 
double-precision  floating-point  vector  of  this  length  requires  0.5  MB  of  memory,  so  the  VcoDE 
interpreter  is  storing  at  most  three  of  these  vectors  at  any  one  time.  The  JDK  Java  interpreter 
using  the  original  VcodeEmulation  class  requires  8.5  MB  of  heap  to  run  the  same  benchmark 
without  triggering  a  garbage  collection,  because  it  cannot  implicitly  reclaim  the  space  used  by  the 
temporary  vectors  that  the  benchmark  generates.  We  therefore  extended  VcodeEmulation  to  reuse 
the  last  vector  popped  from  the  top  of  the  stack  whenever  possible  (i.e.,  whenever  the  vector  is  of 
the  right  length  and  type  to  be  used  for  a  result).  This  modification  reduces  the  minimum  Java 
heap  size  required  to  run  tht  benchmark  without  garbage  collection  to  3.5  MB,  and  all  benchmark 
results  are  for  this  modified  version  of  VcodeEmulation.  A  full  reference-counting  algorithm  similar 
to  that  employed  by  the  Vcode  interpreter  would  probably  reduce  the  memory  usage  still  further. 
This  would  effectively  be  implementing  a  second  level  of  garbage  collection  specialized  for  our 
particular  language,  which  has  rather  unusual  memory  usage  characteristics. 

6  Related  Work 

Several  other  projects  are  also  using  Java  to  implement  high-level  languages,  encompassing  a  wide 
range  of  design  approaches.  Perhaps  the  highest  level  is  Libero  [12],  which  compiles  a  program 
expressed  in  the  form  of  a  finite  state  machine  into  one  of  a  variety  of  target  languages,  including 
Java.  NetRexx  [9]  is  a  dialect  of  the  Rexx  language  that  also  compiles  to  Java.  Both  of  these 
projects  take  an  approach  similar  to  that  described  in  this  paper,  in  that  Java  source  code  is 
generated.  More  directly,  Intermetrics  Inc.  have  adapted  their  Ada  95  compiler  [16]  to  generate 
Java  bytecode,  dispensing  with  the  intermediate  step  of  using  a  Java  compiler  such  as  Sun’s  javac. 
The  Kawa  Scheme-in-Java  compiler  [8]  also  generates  Java  bytecode,  enabling  the  compiler  to 
perform  tail-recursion  elimination  using  the  GOTO  bytecode  instruction  (there  is  no  corresponding 
goto  statement  in  the  Java  language).  Finally,  Hot  TEA  [13]  implements  a  simple  Basic  interpreter 
on  top  of  the  Java  interpreter,  without  requiring  any  compilation  at  all. 

7  Conclusions 

Ideally,  an  intermediate  language  should  be  simple,  portable,  efficient,  and  (when  possible)  main- 
tained  by  somebody  else.  In  this  paper  we  have  investigated  whether  Java  maJces  a  good  interme¬ 
diate  language.  Specifically,  we  have  described  the  design,  implementation,  and  benchmarking  of  a 
system  that  uses  Java  as  an  intermediate  language  for  the  high-level  parallel  language  Nesl.  Java 
proved  to  be  very  e2tsy  to  use,  as  demonstrated  by  the  completion  of  the  prototype  in  a  weekend,  and 
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had  enough  functionality  to  allow  a  clean  implementation  of  the  system.  After  additional  tuning  to 
improve  the  speed  and  space  efficiency  of  the  generated  code,  a  just-in-time  Java  compiler  achieved 
a  performance  between  two  and  four  times  slower  than  that  of  the  existing  implementation  of  Nesl 
(which  uses  hand-tuned  C  code)  on  a  set  of  vector  algorithm  benchmarks.  This  performance  gap  is 
likely  to  narrow  as  just-in-time  compilation  technology  improves.  We  conclude  that  Java  is  a  strong 
candidate  for  use  as  an  intermediate  language  for  rapid  prototyping  of  new  high-level  languages, 
and  for  increasing  the  portability  of  existing  languages. 
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A  Benchmark  Code 

This  section  contains  the  Nesl  source  code  for  the  line-fit,  selection  and  sparse  matrix-vector 
multiplication  routines,  and  also  describes  the  test  data  used  for  the  benchmarks. 

A.l  Line- fit 

function  linefit(x,  y)  = 
let 

n  =  float (#x); 

xa  =  siim(x)/n; 

ya  =  sum(y)/n; 

Stt  =  suin({(x  -  xa)"2:  x})  ; 

b  =  sum({(x  -  xa)  *  y:  x;  y})  /  Stt; 
a  =  ya  -  xa*b; 

chi2  =  suin({(y  -  a  -  b  ♦  x)''2:  x;  y})  ; 
siga  =  sqrt((1.0  /  n  +  xa"2  /  Stt)*  chi2  /  n) ; 
sigb  -  sqrt((1.0  /  Stt)  *  chi2  /  n) 
in 

(a,  b,  siga,  sigb); 

For  the  line-fit  benchmarks,  x  and  y  were  both  copies  of  the  fioating-point  index  vector  [0.0, 1.0, 2.0, . . .]. 


A.2  Selection 

function  select«kth(s,  k)  = 
let  pivot  =  s[#s/2] ; 

les  =  {e  in  s  I  e  <  pivot} 
in 

if  (k  <  #les)  then 
select _kth(les,  k) 
else 

let  grt  =  {e  in  s  I  e  >  pivot} 
in  if  (k  >=  #s  -  #grt)  then 

select_kth(grt ,  k  -  (#s  -  #grt)) 
else  pivot; 

For  the  selection  benchmarks,  s  was  the  integer  index  vector  [0, 1,2, . . .],  and  k  was  one  third  of 
the  length  of  s. 
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A.3  Sparse  matrix-vector  multiplication 

function  nest(p,  mien)  = 
let  vector (seg,vals)  =  p; 

(segl,seg2)  =  mien 
in  vector (segl , vector (seg2,vals) ) ; 

function  MxV(Mval,  Midx,  Mien,  Vect)  = 
let  V  =  Vect  ->  Midx; 

p  =  {a  *  b:  a  in  Mval;  b  in  v} 
in 

{sum(row)  :  row  in  nest(p,  Mien)}; 

For  the  sparse  matrix-vector  multiplication  benchmarks,  every  row  in  the  matrix  had  a  length  of  5 
and  the  matrix  values  were  random  floating-point  data. 
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